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Commonly observed patterns typically follow a few distinct families of probability distributions. 
Over one hundred years ago, Karl Pearson provided a systematic derivation and classification of 
the common continuous distributions. His approach was phenomenological: a difi'erential equation 
that generated common distributions without any underlying conceptual basis for why common 
distributions have particular forms and what explains the familial relations. Pearson's system and 
its descendants remain the most popular systematic classification of probability distributions. Here, 
we unify the disparate forms of common distributions into a single system based on two meaningful 
and justifiable propositions. First, distributions follow maximum entropy subject to constraints, 
where maximum entropy is equivalent to minimum information. Second, different problems asso- 
ciate magnitude to information in different ways, an association we describe in terms of the relation 
between information invariance and measurement scale. Our framework relates the different contin- 
uous probability distributions through the variations in measurement scale that change each family 
of maximum entropy distributions into a distinct family. 



I. INTRODUCTION 

Commonly observed patterns follow a few families of 
probability distributions. For example, Gaussian pat- 
terns often arise from measures of height or weight, and 
gamma patterns often arise from measures of waiting 
times. These common patterns lead to two questions. 
How are the different families of distributions related? 
Why are there so few families, when the possible pat- 
terns are essentially infinite? 

These questions are important, because one can hardly 
begin to study nature without some sense of the funda- 
mental contours of pattern and why those contours arise. 
For example, no one observing a Gaussian distribution 
of weights in a population would feel a need to give a 
special explanation for that pattern. The central limit 
theorem tells us that a Gaussian distribution is a natural 
and widely expected pattern that arises from measuring 
aggregates in a certain way. 

With other common patterns, such as power laws, the 
current standard of interpretation is much more vari- 
able. That variability arises because we do not have a 
comprehensive theory of how measurement and informa- 
tion shape the commonly observed patterns. Without a 
clear notion of what is expected in different situations, 
common and relatively uninformative patterns frequently 
motivate unnecessarily complex explanations, and sur- 
prising and informative patterns may be overlooked [3]. 

Currently, the differences between families of common 
probability distributions often seem arbitrary. Thus, lit- 
tle understanding exists with regard to how changes in 
process or in methods of observation may cause observed 
pattern to change from one common form into another. 



We argue that measurement, described by the rela- 
tion between magnitude and information, unifies the dis- 
tinct families of common probability distributions. Vari- 
ations in measurement scale may, for example, arise from 
varying precision in observations at different magnitudes 
or from the way that information is lost when measure- 
ments are made on aggregates. Our unified explanation 
of the different commonly observed distributions in terms 
of measurement points the way to a deeper understand- 
ing of the relations between pattern and process. 

We develop the role of measurement through maxi- 
mum entropy expressions for probability distributions. 
We first note that all probability distributions can be ex- 
pressed by maximization of eirtropy subject to constraint. 
Maximization of entropy is equivalent to minimizing to- 
tal information while retaining all the particular infor- 
mation known to constrain underlying pattern 043- 
obtain a probability distribution of a given form, one 
simply chooses the informational constraints such that 
maximization of entropy yields the desired distribution. 
However, constraints chosen to match a particular distri- 
bution only describe the sufficient information for that 
distribution. To obtain deeper insight into the causes of 
particular distributions and each distribution's position 
among related families of distributions, we derive the re- 
lated forms of constraints through variations in measure- 
ment scale. 

Measurement scale expresses information through the 
invariant transformations of measurements that leave the 
form of the associated probability distribution unchanged 
fl] . Each problem has a characteristic form of informa- 
tion invariance and symmetry that sets the measurement 
scale [1, m, and the most likely probability distribu- 
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tion associated with that particular scale [j]. We show 
that measurement scales and the symmetries of informa- 
tion invariances form a natural hierarchy that generates 
the common families of probability distributions. We use 
invariance and symmetry interchangeably, in the sense 
that symmetry arises when an invariant transformation 
leaves an object unchanged 25 1. 



The measurement hierarchy arises from two processes. 
First, we express the forms of information invariance and 
measurement scale through a continuous group of trans- 
formations, showing the relations between different types 
of information invariance. Second, the types of aggre- 
gation and measurement that minimize information and 
maximize entropy fall into two classes, each class setting 
a different basis for information invariance and measure- 
ment scale. 

The two types of aggregation correspond to the two 
major families of stable distributions that generalize the 
process leading to the central limit theorem: the Levy 
family that includes the Gaussian distribution as a spe- 
cial case, and the Fisher-Tippett family of extreme value 
distributions. By expressing measurement scale in a gen- 
eral way, we obtain a wider interpretation of the families 
of stable distributions and a broader classification of the 
common distributions. 

Our derivation of probability distributions and their 
familial relations supersedes the Pearson and similar clas- 
sifications of continuous distributions Our system 



derives from a natural description of varying information 
in measurements under different conditions whereas 
the Pearson and related systems derive from phenomeno- 
logical descriptions that generate distributions without 
clear grounding in fundamental principles such as mea- 
surement and information. 

Some recent systems of probability distributions, such 
as the unification by Morris [13, provide great in- 
sight into the relations between families of distributions. 
However, Morris's system and other common classifica- 
tions do not derive from what we regard as fundamental 
principles, instead arising from descriptions of structural 
similarities among distributions. We provide a detailed 
analysis of Morris's system in relation to ours in Ap- 
pendix C. 

We favor our system because it derives the relations be- 
tween distributions from fundamental principles, such as 
maximum entropy and the invariances that define mea- 
surement scale. Although the notion of what is fun- 
damental will certainly attract controversy, our favored 
principles of entropy, symmetries defined by invariances, 
and measurement scale certainly deserve consideration. 
Our purpose is to show what one can accomplish by start- 
ing solely with these principles. 



II. MAXIMUM ENTROPY AND 
MEASUREMENT SCALE 

This section reviews our prior work on the roles of in- 
formation invariance and measurement scale in setting 
observed pattern The following sections extend this 
prior work by expressing measurement in terms of the 
scale of aggregation and the continuous group transfor- 
mations of information invariance. 



A. Maximum entropy 

The method of maximum entropy defines the most 
likely probability distribution as the distribution that 
maximizes a measure of entropy (randomness) subject to 
various information constraints [9|. We write the quan- 
tity to be maximized as 



£ — kCq — XiCi 



(1) 



where £ measures entropy, the Ci are the constraints to 
be satisfied, and k and the Xi are the Lagrange mul- 
tipliers to be found by satisfying the constraints. Let 
Cq — J Pydy — 1 be the constraint that the probabili- 
ties must total one, where Py is the probability distri- 
bution function of y. The other constraints are usually 
written as = / Pyfi{y)dy - Ji, where the fi{y) are 
various transformed measurements of y, and the overbar 
denotes mean value. A mean value is either the average 
of some function applied to each of a sample of observed 
values, or an a priori assumption about the average value 
of some function with respect to a candidate set of prob- 
ability laws. If fi{y) = y*, then fi are the moments 
of the distribution — either the moments estimated from 
observations or a priori values of the moments set by 
assumption. The moments are often regarded as "stan- 
dard" constraints, although from a mathematical point 
of view, any properly formed constraint can be used. 

Here, we confine ourselves to a single constraint of mea- 
surement. We express that constraint with a more gen- 
eral notation, Ci = / PyT{fy)dy - ff, where fy = f{y), 
and T{fy) = Tf is a transformation of fy. We could, of 
course, express the constraining function for y directly 
through fy. However, we wish to distinguish between an 
initial function fy that can be regarded as a standard 
measurement, in any sense in which one chooses to inter- 
pret the meaning of standard, and a transformation of 
standard measurements denoted by Tf that arises from 
information about the measurement scale. 

The maximum entropy distribution is obtained by solv- 
ing the set of equations 



d£ _ _ 

dpy dpy 



(2) 



where one checks the candidate solution for a maximum 
and obtains k and A by satisfying the constraint on total 
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probability and the constraint on Tf. We assume that 
we can treat the entropy measures and the maximization 
procedure by the continuous hmit of the discrete case. 

In the standard approach, we define entropy by exten- 
sion of Shannon information 

in which this expression may be called Jaynes's differen- 
tial entropy 0], which is equivalent in form to the con- 
tinuous expression of relative entropy or the Kullback- 
Leibler divergence [ij. Here, we will interpret my by in- 
formation invariance and measurement scale as discussed 
below. With these definitions, the solution of Eq. ^ is 

Py (X TTlye^^'^f , (4) 

where A satisfies the constraint Ci, and the proportion- 
ality is adjusted so that the total probability is one by 
choosing the parameter k to satisfy the constraint Cq. 



B. Information invariance and measurement scale 

Maximum entropy must capture all of the available 
information about a particular problem. One form of in- 
formation concerns transformations to the measurement 
scale that leave the most likely probability distribution 
unchanged [1, 0, Q. Here, it is important to distinguish 
between measurements and measurement scale. In our 
notation, we start with measurements, fy, made on the 
measurement scale y. For example, we may have mea- 
sures of squared deviations about zero, fy — y"^, with 
respect to the measurement scale y, such that fy is the 
second moment of the measurements with respect to the 
underlying measurement scale. 

Suppose that we obtain the same information about 
the underlying probability distribution from measure- 
ments of fy or transformed measurements, G{fy). Put 
another way, if one has access only to measurements 
G{fy), one has the same information that would be ob- 
tained if the measurements were reported as fy. We 
say that the measurements fy and G{fy) are equivalent 
with respect to information, or that the transformation 
fy — > G{fy) is an information invariance that describes a 
symmetry of the measurement scale. 

To capture this information invariance in maximum 
entropy, we must express our measurements so that 

T{fy) ^ 5 + ^T[G{fy)] (5) 

for some arbitrary constants 5 and Putting this 
definition of T{fy) = Tf into Eq. ^ shows that we get 
the same maximum entropy solution whether we use the 
observations fy or the transformed observations, G{fy), 
because the k and A constants will adjust to the constants 
6 and so that the distribution remains unchanged. 



III. DERIVING PROBABILITY 
DISTRIBUTIONS 

The prior section established two key steps. First, 
maximum entropy probability distributions have the 
form given in Eq. ^ a.s Py oc niye^^'^f . Second, the 
expression of T{fy) for each problem comes from the 
particular information invariance G{fy) associated with 
that particular problem. To derive specific probability 
distributions, we must pass three further steps, which we 
develop in the following sections. 

First, we turn the abstract notions of information in- 
variance and measurement scale into specific expressions 
for the measurement scale function, T{fy). We accom- 
plish this by developing the continuous group transfor- 
mations for information invariance. Those continuous 
transformations provide an abstract hierarchy of forms 
for probability distributions based on the scale factor, 
niy, the specific measured attribute, /j,, and how the 
information and precision of measurements change with 
magnitude expressed by the measurement scale T{fy). 

Second, we define my as the relation between the scale 
of information invariance and the scale on which we ex- 
press probability. To use the maximization of entropy 
and the associated minimization of information, we must 
relate the information invariance of measurement to the 
scale on which underlying processes dissipate informa- 
tion. We consider alternative interpretations of scale that 
may be associated with the dissipation of information by 
aggregation of random perturbations and by measure- 
ments of extreme values. We also consider measurements 
on a scale that differs from the basis for dissipation of in- 
formation. 

Third, we consider how to interpret /y, which is the 
value used to describe the informational constraint in re- 
lation to the measurement scale T{fy), leading to the 
constraint Tf . We discuss fy as a reduction in the dimen- 
sionality of information to a single sufficient dimension. 
That sufficient dimension sets the form of probability un- 
der the various processes of information dissipation that 
lead to the common probability distributions. 



IV. CONTINUOUS GROUP 
TRANSFORMATIONS OF MEASUREMENT 

The transformation in Eq. ([5]) sets the relation between 
information invariance and measurement scale. However, 
that expression does not show in a simple way the rela- 
tions between information and measurement. 

To understand commonly observed patterns in relation 
to the families of probability distributions, it is help- 
ful to express in a general way the underlying symme- 
try that determines information invariance and measure- 
ment scale. From that underlying symmetry, we may see 
more clearly the associated relations between the forms 
of probability distributions. 
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A. The afRne structure 

The relation between information invariance and mea- 
surement scale in Eq. (O arises directly from the form 
of maximum entropy solutions in Eq. (|4]), in which prob- 
ability distributions are exponentials of the transformed 
constraint measures, Tf. In particular, the probability 
distribution associated with a constraint is invariant to 
an additive shift of the constraint and a multiplicative 
change in the scale of the constraint, given by the param- 
eters S and (j) in Eq. ([5]). It is that symmetry in the affine 
structure of invariant transformation that ultimately sets 
the underlying relations between information, measure- 
ment, and the familial forms of the common probability 
distributions. 

To understand the affine structure of the invariant 
transformation in Eq. ([5]) more clearly, we can express 
that invariant transformation as a continuous operator. 
First, rearrange Eq. ([5]) as an equivalent expression 



TiGify)] 



bTify) 



(6) 



with new parameters a and b that are easily calculated 
from Eq. ([5]). We show in Appendix A that we can ex- 
press the same information invariance of G{fy) by the 
differential operator defined as 



that can be applied to T as 



v^:{T) ^ a + I3T. 



(7) 



(8) 



of measurement and information invariance. By think- 
ing of w{fy) as a parameter that expresses the defor- 
mation of measurement associated with a measurement 
scale, T(fy), we can create a sequence in which each suc- 
cessive deformation corresponds to a successive class of 
probability distributions with familial relations to each 
other defined by the structure of the sequence of defor- 
mations to w{fy). 



B. The general form of probabiHty distributions 

From Eq. the maximum entropy solution is 

Py (X m.ye^^'^f . (12) 

FromEq. (fTTj) . we can now express the maximum entropy 
solution as 



Py 



-Ae" 



(13) 



where A = A(To + a/ 13), and w = w(fy). In the limit 
/3 0, this becomes 

Py (X TUye ' 

where 7 — Xa. 

In Appendix B we describe the case of extreme values, 
for which we will use niy = dT{fy)/dy. When fy = y 
and niy = dT{y)/dy = T', it will be convenient to write 

r (X w'e''"', (14) 

where w' = dw{y)/dy, and as /3 — >■ 0, T' cx w' . 



Recursive application of preserves the affine structure 
and so keeps the successive transformations within the 
group of admissible invariance relations. 
We can express Vw as 



_d_ 

dw ' 



(9) 



where w = w{Jy) is some function of fy. Wc then have a 
differential equation for T as 



dT 
dw 



— I3T — a, 



which has solutions of the general form 



nfy) = Toe 



(10) 



(11) 



which as /3 goes to T[fy) To + ctw. Eq. ([TT]) gives 
the most general class of measurement functions, T{fy), 
for which the associated transformations generated by v^^ 
preserve information invariance. 

The operator can be applied repeatedly, creating a 
recursively generated sequence of deformations that all 
satisfy the fundamental relation between deformations 



V. INTUITIVE DESCRIPTION OF 
MEASUREMENT AND PROBABILITY 

Intuitively, one can think of the symmetry of informa- 
tion invariance and measurement scale in the following 
way. On a linear scale, each incremental change of fixed 
length yields the same amount of information or surprise 
independently of magnitude. Thus, if we change the scale 
by multiplying all magnitudes by a constant, we obtain 
the same pattern of information relative to magnitude. 
In other words, the linear scale is invariant to multiplica- 
tion by a constant factor so that, within the framework of 
maximum entropy subject to constraint, we get the same 
information about probability distributions from an ob- 
servation y or G{y) = cy. In this section, we use fy — y 
to isolate the symmetry expressed by particular choices 
of T and G. 

On a logarithmic scale, each incremental change in pro- 
portion to the current magnitude yields the same amount 
of information or surprise. Information is scale depen- 
dent. We obtain the same information at any point on 
the scale by comparing ratios. For example, we gain the 
same information from the increment dy/y — dlog(y) in- 
dependently of the magnitude of y. Thus, we achieve in- 
formation invariance with respect to ratios by measuring 
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increments on a logarithmic scale. Within the framework 
of maximum entropy subject to constraint, we get the 
same information about probability distributions from 
an observation y or G{y) — y'^, corresponding to infor- 
mationally equivalent measurements T(y) = log(y) and 
T{y'=) = clogiy) (see ref. Q). 

The form of a probability distribution under maximum 
entropy can be read directly as an expression of how the 
measurement scale changes with magnitude. From the 
general solution in Eq. ([4]), linear scales T{y) oc y yield 
distributions that are exponential in y, whereas logarith- 
mic scales T{y) (x c\og{y) yield distributions that are 
linear in y'^. Exponential distributions of the form e~'^^ 
arise from underlying linear scales, whereas power law 
distributions of the form y~'^ arise from underlying loga- 
rithmic scales. 

Many common distributions have compound form, in 
which one can read directly how the underlying mea- 
surement scale changes with magnitude. For example, 
the gamma distribution has form y^^e^^"^ . When the 
magnitude of y is small, the shape of the distribution is 
dominated by the power law component, y^'^. As the 
magnitude of y increases, the shape of the distribution is 
dominated by the exponential component, e~^^. Thus, 
the underlying measurement scale grades from logarith- 
mic at small magnitudes to linear at large magnitudes. 
Indeed, the gamma distribution is exactly the expression 
of an underlying measurement scale that grades from log- 
arithmic to linear as magnitude increases. Nearly every 
common probability distribution can be read directly as 
a simple expression of the change in the underlying mea- 
surement scale with magnitude. 



VI. HIERARCHIES OF COMMON 
PROBABILITY DISTRIBUTIONS 

Given a particular form for the function w{fy), the 
measurement scale T{fy) follows from Eq. ([TT]) and the 
associated probability distribution follows from Eq. (IT51) . 
Although we can choose w in any way that we wish, cer- 
tain measurement scales and information invariances are 
likely to be common. We discussed in our earlier pa- 
per the importance two scales |4|. The first scale grades 
from linear to logarithmic as magnitude increases, which 
we call the linear-log scale. The second scale inverts the 
linear-log scale to be logarithmic at small magnitudes and 
linear at large magnitudes, giving the log-linear scale. 
The inversion relating the two scales can be expressed by 
a Laplace transform, showing the natural duality of the 
scales and a connection to recent studies on superstatis- 
tics 

A. The linear-log scale 

In terms of the notation in the present paper, we can 
define w to establish a hierarchy of measurement defor- 



mations, in which each level in the hierarchy arises from 
successive application of the linear-log scaling to the scale 
in the previous level in the hierarchy. 

To define the linear-log measurement function in terms 
of w, note from Eq. (fTTI) that, as /3 — >■ 0, the forms of w 
and the measurement function T become equivalent with 
respect to setting the associated probability distribution. 
Thus, by setting w, we are defining the limiting form of 
the measurement function. With these issues in mind, 
define 

=log (c; -fw^*"^)) , 

with w^^'' = fy. The constant q sets the transition be- 
tween linear and logarithmic scaling: the scale is linear 
when w^*"^-' is small relative to q and logarithmic when 
is large relative to q. As ci — > 0, we can use 

u;W = log 

It is easiest to see the abstract structure of the mea- 
surement hierarchy and the associated forms of probabil- 
ity distributions in the limiting case q — 0, leading to 
purely logarithmic deformations. The first row of Table |T] 
begins with the base measurement w'^^-' — fy. The fol- 
lowing two rows show the first two deformations for the 
sequence i = 0, 1, 2. 





Py 


Py\P^O 


fy 

log /a 

log log fy 


niye 
ruye 

mye-<'°^fyy 


myf-^ 

my (log/a)"^ 



TABLE I. The logarithmic measurement hierarchy and the 
associated form of the probability distribution function py 
from Eq. (I13|) . Note that /3 — ^ of each line corresponds to 
/3 = 1 of the following line. 

This table gives the hierarchy of probability distribu- 
tions that arise from successive logarithmic deformations. 
With this structure in mind, we give the full expansion 
with Q 9^ in Table III 

We discuss the interpretation of my and fy below. The 
different interpretations of those values lead directly to 
specific forms for probability distributions. Before inter- 
preting TUy and fy, we present an alternative measure- 
ment scale. 



B. The log-linear scale 

We obtain the log-linear measurement deformation hi- 
erarchy Q from 

= c,w(*-i) +iog (w^'-^'^y 

from which we obtain the probability distributions in Ta- 
ble mil The log-linear scale changes logarithmically at 
small magnitudes and linearly at large magnitudes. 
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Py 




l0g(Cl + fy) 

log(c2 + log (ci + /a)) 


mye-'^'^"' 
mye-'^i^^+fyf 


ruye '^^ 

my (ci + fy)-''' 

my (C2+l0g (Cl+fy))-'' 



TABLE II. The linear-log measurement hierarchy. 





Py 


Py\P^Q 


fy 

Clfv +^Ogfy 

C2 (Cl/y + log fy) + log {Clfy + lOg fy) 


mye-^fS^'^''^ 
mye'^^'^ 


mye-^'fy 

myfy-'e-^^^fy 

raye-''"^ 



TABLE III. The log-linear measurement hierarchy. In the last line of the table, we use w = w{fy) to shorten the expression. 



C. Other scales 

The linear-log and log-linear scales describe common 
forms of measurement functions. In this section, we 
briefly mention some other scales listed in Table IIVI 
These additional scales illustrate the ways in which mea- 
surement relates to the patterns of probability. 

The first line of Table IIVI shows a log-linear-log scale 
for a measure on the interval (ci, C2). That scale changes 
logarithmically near the boundaries and linearly near the 
middle of the range, in which log b describes the skew in 
the scaling pattern. 

The second line of Table IIVI shows a linear-log-linear 
scale for fy > 0. That scale changes linearly near the 
lower boundary of zero, linearly at large magnitudes, and 
logarithmically at intermediate values. 



VII. THE SCALE OF INFORMATION 

The prior section presented probability distributions in 
terms of my and fy. This section develops the interpre- 
tation of uiy , which arises from the relation between the 
scale of information invariance and the scale on which we 
express probability. 

The key issue is that maximum entropy requires some 
underlying process to dissipate information. With re- 
gard to deriving probability distributions, we may con- 
sider three aspects of scale in relation to the dissipation 
of information. First, we may measure an outcome that 
arises from the aggregation of a series of random pertur- 
bations. Second, we may measure only the extreme val- 
ues of some underlying process, thereby throwing away 
all information about the underlying process except the 
form of the upper or lower tail of the underlying distribu- 
tion. Third, the dissipation of information may occur on 
one scale, but we may wish to make our measurements 
with respect to another scale. 

Each of these three aspects of the scale of information 
dissipation leads to a simple interpretation of probabil- 



ity measure in maximum entropy analysis. We give a 
brief description each scale of information dissipation in 
relation to calculating my. 



A. Aggregation of perturbations 

In the standard application of maximum entropy, ac- 
cumulation of random perturbations without constraint 
leads to a uniform probability measure, which has maxi- 
mum entropy and minimum information. Thus, the scale 
at which information dissipates is the same as the scale of 
the probability measure. In this case, our formulation of 
maximum entropy has my = 1, because any information 
that arises from deformation of measurement relative to 
the uniform default is included in our expression of mea- 
surement scale, T[fy). 



B. Extreme values 

The distribution of extreme values depends only on 
the total (integral) of the probability measure in the tail 
of an underlying probability distribution Because 
extreme value distributions arise from integrals of prob- 
ability measures, the dissipation of information and the 
associated measurement scale for extreme values is ex- 
pressed in terms of the cumulative distribution function 
(see Appendix B). To obtain the associated form of the 
probability measure with respect to the probability dis- 
tribution function, py, we must transform the invariant 
measurement scale originally expressed with respect to 
the integral of the underlying probability measure. 

To change from the integral scale of the cumulative 
distribution to the scale of the probability measure asso- 
ciated with the probability density function, we simply 
differentiate the initial measurement scale, T(fy), from 
the cumulative distribution scale to obtain the associ- 
ated change in probability measure (Appendix B). For 
fy = y, we obtain my = dT{y)/dy = T' . We gave the 
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■w{fy) 


Py 




C2fy +blOg{Cl+ fy) 




my{ci+ fy)-''^e-''^^^y 



TABLE IV. As /? — )■ 0, line 1 is a log-linear-log measurement scale, and line 2 is a linear-log-linear measurement scale. 



general form of m,y = T' in Eq. (fT4l 



C. Change of variable 

In some cases, information may dissipate on one scale, 
but we choose to express probability on another scale. 
The log-normal distribution is the classic example. Us- 
ing Table m we may consider measurements that lead to 
the Normal or Gaussian distribution by either analyzing 
squared deviations from a central value, fy = {y — m)^ 
in line one of Table |T] with /? — > or, equivalently, linear 
perturbations of fy^{y — m) iii lin^ two of Table U with 
/3 = 2. In these cases, the perturbations are direct mea- 
sures rather than the tail probabilities of extreme values, 
so rriy = 1, and we have the standard form of the Gaus- 
sian as Py (X e"'*'^^"'^^^. 

If we prefer to analyze values on a logarithmic scale, 
then we make the transformation y — logy. This case 
does not arise from invariant information and the asso- 
ciated measurement transformation, but rather from a 
change of variable to a different scale. So we must change 
our measure, as in any standard change of variable. In 
this case, the change of measure is mydy ~ dlogy = 
dy/y, thus ruy — y~^ and we obtain the log-normal dis- 
tribution pj, cx y~ie~^^'°s^~'^-' , where 7 and jl are trans- 
formed appropriately. 



VIII. SUFFICIENCY: REDUCTION OF 
INFORMATION 

The algorithm of maximum entropy allows us to choose 
any constraint T{fy). However, one of our main goals is 
to provide a clear rationale for the choice of constraint, 
so that maximum entropy is more than a tautological de- 
scription of probability distributions. We have expressed 
the choice of the measurement scale, T, in terms of in- 
formation invariance set by the underlying problem. Al- 
though information invariance may take various forms, 
we followed our earlier paper Q in which we defended 
the linear-log and log-linear scales as likely to be common 
scales associated with common information invariances. 

Once we have set the transformation T{fy) by these 
common information invariances, many widely observed 
probability distributions follow. In some cases, deriv- 
ing probability distributions requires using an observ- 
able, fy 7^ y, that differs from the scale y of the un- 
derlying probability measure. For example, we may use 



the squared deviations from a central location, or a frac- 
tional moment fy = y", where a is not an integer Q. 
Use oi fy = y or of squared deviations fy = {y — ^ 
widely accepted. Such choices lead to fy being a suffi- 
cient reduction of all of the information in observations 
in order to express common probability distributions. 

For our purposes in this paper, we simply note that 
we can derive many common distributions by the widely 
accepted use of fy — y or fy as a squared deviation. But 
the reasons that particular choices of fy are sufficient 
have not been fully explained with regard to maximum 
entropy, particularly fractional moments such as fy = y" 
[sj. Those reasons probably have to do with the sort 
of analysis described by large deviation theory [2J|, in 
which the retained information arises from the minimal 
descriptions of location and scale that remain when one 
normalizes the consequences of a sequence of perturba- 
tions so that one obtains a stable limiting form. 



IX. CONCLUSIONS 

Table |V| shows many of the commonly observed prob- 
ability distributions. Those distributions arise directly 
from maximum entropy applied to various natural mea- 
surement scales. The measurement scales express infor- 
mation invariances associated with particular types of 
problems and the scale on which information dissipation 
occurs. We confined ourselves to various combinations 
of linear and logarithmic scaling, which were sufficient to 
express many common distributions. Our method read- 
ily extends to other types of information invariance and 
measurement scale and their associated probability dis- 
tributions. 
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Distribution 


Py 


T.L.C 


my 


fy 


Notes and alternative names 


Gumbcl 




ttll.2 


r 


y 




Gibbs/Exponential 




ttll.3 


T', 1 


y 




Gauss/Normal 




111.3 


1 


y' 




Log-Normal 




Hl.S 




y' 


Change of variable y — >■ log y 


Frechet/Weibull 


„ fi-^„-^y'^ 
e 


ttl2.2 


T' 


y 




Stretched exponential 


e " 


ttl2.2 


1 


y 


Gauss with 13 — 2 


Symmetric Levy 


g-Alal (Pourier domain) 


ttl2.2 


1 


\y\ 


/3 < 2; Gauss (/? = 2), Cauchy (/? = 1); ref. [3| 


Pareto type I 




ttl2.3 


T',1 


y 




Log-Frechet 




ttl3.2 


r 


y 


Also from Frechet: y — >■ logy, rriy = y^^T'{y) 


?? 


g-A(logi;)'' 


113.2 


1 


y 


Also stretched exponential with fy — log y 


Log-Pareto type I 


y~^ (logy)"^"^ 


113.3 


r 


y 


Log-gamma; Pareto I: y ^ logy, my — y~^ 


?? 


(logj/)""^ 


113.3 


1 


y 


Also from Pareto I with fy — logy 


Pareto type II 




|lll2.3 


1 


y 


Lomax 


Generalized Student's 


(ci + y^)-" 


|lll2.3 


1 


y' 


Pearson type VII, Kappa; includes Cauchy 


?? 


(log(ci+y))-^ 


|lll3.3 


1 


y 


C2 — 0; also Pareto I with fy = log(ci + y) 


Gamma 


y-7g-ci7a 


|IlIl2.3 


1 


y 


Pearson type III, includes chi-square 


Generalized gamma 




|IlIl2.3 


1 


y" 


Chi with fc = 2 and C17 =1/2 


Beta 


(c2 -y)~^(y-ci)"''^ 


IIVll.3 


1 


y 


Pearson type I; log-linear-log on (ci,C2) 


Beta prime/F 


y-i>7(l + y){i>+l)7 


|IVll.3 


1 


y 
i+y 


Pearson type VI, y > 


Gamma variant 


(ci +j/)-''^e-'^2^" 


IIVl2.3 


1 


y 


Linear-log-linear pattern as y rises from zero 



TABLE V. Some common probability distributions. The column T.L.C gives the table, line, and column of the underlying 
form presented in the earlier tables of abstract distributions. For example, Ull. 2 refers to Table [11 first line, second column. 
The measurement adjustment is given as either my = 1 for direct scales, or my = T' for extreme values as in Eq. (|14p . along 
with any consequences from a change of variable such as y — >■ logy. Cases in which the same structural form arises for either 
my — T' or my = 1 are shown as T', 1, without adjusting parameters for trivial differences. The value of fy gives the reduction 
of data to sufficient summary form. Direct values y, possibly corrected by displacement from a central location, y — /£, are 
shown here as y without correction. Squared deviations (y — /i)^ from a central location are shown here as y^. See refs. mmj 
for listing of distributions. Many additional forms can be generated by varying the measurement function. In the first column, 
the question marks denote a distribution for which we did not find a commonly used name. 



APPENDICES 

Appendix A: On the association between 
measurement functions and classes of scale 
transformations 

If the transformation fy — )• G{fy) is an invariancc of 
a measurement function T, it is clear that repeated ap- 
plications of G, expressed as G o G,G o G o G, . . ., are 
also invariances of T. It is the larger group of invariances 
that we wish to identify with the measurement scale that 
defines T, and not only a single transformation. To sim- 
pHfy notation in this Appendix, we use fy = y. The same 
analysis applies to fy. 

In general, making a unique association between a 
transformation G and a measurement function T is incon- 
venient for finite transformations, because G combines a 
magnitude and a direction of deformation. The magni- 
tude is added under compositions Go G . . . , while the 
direction remains invariant. As we will derive below, the 
relevant measure of the magnitude of a transformation 
as in Eq. ^ will be ~ log 6, and the relevant measure 
of direction will be a/ (& — 1). To isolate the direction of 



G that may be associated with a measurement function 
T, we work with infinitesimal rather than finite affine 
transformations . 

Infinitesimal transformations are constructed from 
Eq. ^ in the text by writing a = ea, (5 — 1) = e/3, and 
then taking e — > for fixed a and /?. An infinitesimal 
transformation G'^ then satisfies Eq. ([5]) in the form 

T[G%y)]=Tiy) + e[a + (3T{y)]. (Al) 

G itself must therefore also be infinitesinially different 
from the identity, and must have the form 

G^iy)^y + eviy). (A2) 

for some function v{y). 

We introduce a quantity v called the generator of the 
deformation, such that the operator e'" generates the in- 
finitesimal transformations Eqs. (|A1IA2[) . and such that 
finite transformations G or affine transformations Eq. ([5]) 
are produced by the exponential operation of i) with non- 
infinitesimal e. Compounding a function corresponds to 
addition of parameters e, as may be checked from the 
power-series definition of e'^'" within its radius of conver- 
gence. 
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We define a representation of the generator v as an 
explicit differential operator that produces the correct 
transformation on the argument y or T{y), as appropri- 
ate. The two representations of the generators are related 
as 



T[y + ev{y)] = 



1 + e (a + /3r) — 



(A3) 

From the requirement that the two expressions produce 
the same result, we may assign the representations 



V O v{y) 
O (a + 



dy 

m 



_d_ 
dw 

_d_ 

dT 



(A4) 



for some function w[y) — dy'l/v{y'). 

Regarding T as a function of argument w rather than y, 
and setting equal the two coefficients of e in Eq. (IA3|) . we 
obtain a relation between any function w{y), coefficients 
a and /3, and the function T in the form 



dT 

dw 



= a + l3T. 



(A5) 



This is rearranged to produce Eq. ((T0| . 

From the solutions to Eq. (jA5|) . we may readily check 
that the action of the transformation e*^" for arbitrary e 
(not necessarily small) is 



under composition of G may be worked out easily, but 
depending on the function w{y), the direct composition of 
finite transformations G on y may be quite complicated. 



Appendix B: Information measures for cumulative 
distributions 

The presence of the measure my in the probability 
density function in Eq. complicates the discussion 
of measurement invariance, because in the general case 
niy is not required to obey any prescribed transforma- 
tion when fy — T' G{fy). In general, y need not even be 
a numerical index, whereas T{fy) is necessarily numeri- 
cal because it is proportional to an information measure 
-\og{py/my). 

The class of cases in which the measurement function, 
T, completely controls the properties of Py are those in 
which measurement constrains the cumulative probabil- 
ity distribution function rather than the probability den- 
sity function. For these cases ruy is not independent, but 
is given in terms of T and fy, as we now show. 

Relative entropy is ordinarily defined for the probabil- 
ity density. However, if we set 



dy 



dy 



(Bl) 



then my becomes a Lebesgue measure on y with re- 
spect to the increment dT. The probability density from 
Eq. (fT^ becomes 



Py cx — e 
dy 



->-T(fy) 



(B2) 



T[G(y)]=e'*T(2/) = ^(e^^-l) 



+ e^^T(y) , (A6) 



from which we recover expressions for the coefficients a 
and h in Eq. Under composition G — >■ G o G, the 
parameter e — !■ 2e. The composition rules for a and b 



Eq. (|B2|) defines the relation between a probability den- 
sity and its cumulative distribution, meaning that under 
a suitable ordering of y, we may take e'^-^'^'^^y'^ to be the 
cumulative distribution. 

With this choice of measure, the relative entropy £ 
from Eq. ^ becomes 



dypylog 



Py 



- - / dy 



dT 



Py 



\dT/dy 
dTpT log pt, 



loe 



Py 



dT/dy 



(B3) 



in which px is the probability density defined on the vari- 
able T. Since the maximum-entropy solution is always 
exponential in T, the relative entropy of Eq. (jB3[) is ef- 
fectively an information function for the cumulative dis- 
tribution. 

An application in which constraints under aggregation 
apply by construction to the cumulative distribution is 
the computation of extreme- value statistics [13] . The cu- 



mulative probability distribution for the maximizer or 
minimizer of a sample of n realizations of a random vari- 
able is the product of n factors of the cumulative distri- 
bution for a single realization. 

It was also noted in ref. that the relative entropy 
may be evaluated on the characteristic function (Fourier 
or Laplace transform) of a distribution, and that the 
maximum-entropy solutions in the transformed domain 
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are the Levy stable distributions. The characteristic 
function at frequency argument fc = always takes value 
unity. Therefore it, like a cumulative distribution, has 
a reference normalization of unity, and indeed, the sym- 
metric Levy-stable distributions [21] correspond in form 
to the WeibuU family of extreme value distributions. 
Both are obtained within our classification for my de- 
fined by Eq. (|B1|). for suitable reductions fy. 



Bayesian methods, as discussed in Sec. 7 of Ref. [4j. 
Through the measurement function, it relates a poten- 
tially nonlinear contour of deformations of measured 
magnitudes to a linear transformation within the affine 
group that exists for general maximum-entropy prob- 
lems. We have embedded distributions within a hier- 
archy by using the two-parameter freedom of the affine 
group to provide a range of responses of information to 
the change in the scale of measurement. 



Appendix C: The Morris Natural Exponential 
Families in relation to entropy- maximizing 2. The Morris classification of distributions in 

distributions relation to maximum entropy 



1. Symmetry-based approaches to select or to 
classify probability distributions 

Many systems, since Pearson's, for either selecting or 
classifying probability distributions, have been based on 
symmetry groups, as our method is. (Pearson's system 
may be seen as one based on the analytic structure of the 
log-probability, a criterion that we will return to con- 
sider in a moment.) The systems differ in generality, 
depending on the space in which the symmetry group 
acts, and depending on whether it constrains a single 
distribution or a family. Two methods based on sym- 
metry (ours and that of Carl Morris, described below) 
have interpretations in terms of scale invariance of ob- 
servables. Both systems collect probability distributions 
into families, whose members differ only by a scale factor. 
A third approach (known as Objective Bayesian meth- 
ods) applies symmetry to the underlying measure space 
which, as we note in Appendix [B] may be very different 
from the space of observed magnitudes. This approach 
is concerned not directly with families of distributions, 
but with the particular distribution defined by a refer- 
ence measure. We will briefiy summarize the overlaps 
and differences of these methods. 

Objective Bayesian methods, initiated by Jeffries [l^ 
but given the interpretation of objectivity largely by 
Jaynes recognize that the reference measure ruy 
in a relative entropy — beyond being needed to make 
logarithms well-defined and independent of change of 
variables — may reflect information about measurement 
scales. By ensuring that the reference measure is consis- 
tent with known symmetries of the phenomenon under 
study (which are not generally expressed within partic- 
ular sample observations). Objective Bayesian methods 
seek to systematize the entire maximum-entropy proce- 
dure. This use of the reference measure is consistent 
with our treatment of measurement, though by itself it 
is more limited, as we discuss in Ref. Q, and it may also 
be misleading in cases ^22] . In the context of the present 
discussion, the most important limitation of Objective 
Bayesian methods is that they select properties of a sin- 
gle distribution rUy, rather than properties of a family. 

Our approach broadens the class of symmetries that 
can be considered, beyond those available to Objective 



In a pair of papers in 1982 and 1983 [13, [11, Carl 
Morris proposed another classification system for prob- 
ability distributions, which overlaps both with Objec- 
tive Bayesian methods and with our approach. Like our 
method, Morris's concerns families of probability distri- 
butions generated by a change in constraint or measure- 
ment scale. Like all of the approaches we have mentioned, 
Morris's system uses relative entropy in a conventional 
maximization framework. That system differs from ours 
in using only a linear constraint on what Morris terms the 
natural observation, and obtaining nonlinear dependence 
on that constraint through a second boundary condition 
placed on entropy. 

The Morris system blends interesting elements of Pear- 
son's restrictions on analytic structure, our use of sym- 
metry, and the Objective Bayesian concern with the ref- 
erence measure, as follows: Morris considers distribu- 
tion families that are invariant under offset and rescaling 
of the natural observation, which Morris labels X, and 
which is analogous to using a coordinate system that is 
always linear in our fy. His classification therefore does 
not not invoke any explicit representation of the sym- 
metries inherent in differing measurement systems. In 
order to encompass distributions that are not simply ex- 
ponential in the values x (taken by the observation X), 
he instead restricts the form of the reference measure in 
a relative entropy, analogous to our my. Unlike Objec- 
tive Bayesian methods, however, this restriction does not 
come from the direct action of a symmetry on the refer- 
ence measure, but rather from the form of the relative 
entropy across the family of distributions produced by 
scale change. 

The classification system of Ref. [l3] derives from the 
cumulant-generating function and the relation between 
the variance and the mean as the parameter in this gen- 
erating function is shifted. The distributions that define 
the cumulant-generating function constitute what Mor- 
ris calls natural exponential families (NEF), and the de- 
pendence of variance on mean within these families is 
restricted in his system to be an exact quadratic polyno- 
mial. The resulting subclass of distributions within the 
NEF class is termed QVF (for quadratic variance func- 
tion). The mean- variance relation that defines the NEF- 
QVF distributions is preserved under offset and rescal- 
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ing of the natural observation, and under convolution. 
Therefore, the distributions in this class would be ex- 
pected to arise frequently in problems of aggregation. We 
show in this appendix that the QVF condition is equiv- 
alent to the requirement that the relative entropy over 
a family has the form (up to analytic continuation) of 
a Kullback-Leibler divergence. The analytic continua- 
tion is determined by the roots of the quadratic variance 
polynomial, and these roots in turn have a relation to the 
roots for log-probability in the Pearson system. 

The distributions selected by Morris's criterion are ei- 
ther bounded, or have exponential or faster decay in their 
tails. We show that, when they are classified according 
to their analytic structure, they are in fact either inte- 
rior members or degenerate limits of only two families 
of distributions: One family of continuous-valued dis- 
tributions is associated with complex-conjugate roots of 
the variance function, and a complex analytic continu- 
ation of the Kullback-Leibler form for relative entropy. 
A second family of discrete-valued distributions is asso- 
ciated with real-valued roots, and real-valued continu- 
ations of the Kullback-Leibler relative entropy. In this 
sense, the Morris classification shows that six important 
distribution families are in fact selected by a single set 
of invariances — of these, the offset and scale invariances 
are instances of our linear measurement rescaling. These 
selected families are therefore very commonly observed, 
but also rather tightly restricted. Preservation of a func- 
tional class under convolution is similar to the criterion 
leading to the extreme- value or Levy distributions, as we 
have discussed in the main text, and is therefore one of 
many forms of measurement invariance that may be con- 
sidered. 

Here we will re-formulate the Morris criterion and its 
solutions within a standard framework of maximum en- 
tropy. We will show that the role of the reference mea- 
sure in a relative entropy is equivalent to that of a second 
observed quantity, which will generally be linearly inde- 
pendent of the natural observation X. Scale change of 
the natural observation defines what is known as an ex- 
pansion path, which consists of the distributions within 
an exponential family. The second observed quantity, as- 
sociated with the reference measure, is given a gradient 
constraint rather than a value constraint. It is through 
the interaction of these two constraints that nonlinear 
dependence on x is obtained in the log-probability. At 
the end of the Appendix we mention a relation between 
the Morris system and the Pearson system based on the 
log-probability. When the Morris QVF criterion is ex- 
pressed as a formal constraint on entropy, this form is 
imposed on the leading terms of log-probability by the 
large-deviations property of cumulant-generating func- 
tions. 



3. Definition of the natural exponential families 

The NEF distributions are defined in relation to the 
cumulant-generating function, which arises naturally in 
the method of maximum entropy. The most direct way 
to re-formulate the original presentation of Refs. [13, [HI 
in terms of maximum entropy is to assume a (Shannon- 
type) entropy in a higher-dimensional state space than 
the univariate space of the natural observation X. The 
high-dimensional states have non-uniform density when 
they are projected onto the one dimension in which the 
probability distribution varies. Once a Lagrangian is de- 
fined from this initial re-formulation, it becomes easy to 
re-interpret the density of states as a reference measure in 
a relative entropy (and the latter interpretation is more 
general). The cumulant-generating function is then the 
Legendre transform of this relative entropy. We develop 
the two interpretations in order, to connect the deriva- 
tions of Refs. [13, [13 systematically to the formulation 
we use in the main text. 



a. The Stieltjes measure as a density of states 

Ref. [l3l introduces a Stieltjes measure dF{x), and an 
initial probability distribution Pq associated with this 
measure, defined by 

PoiXeA) = I dF{x), (CI) 

J A 

for an arbitrary set A in the range of x. With respect 
to this original probability measure, Morris introduces 
the exponential families in terms of a probability mass 
function 

(t>ix \ 6) = e''''-^^^\ (C2) 

which multiplicatively weights the original measure 
dF{x). The normalizing constant ipif^) in Eq. (|C2I) is 
the cumulant-generating function, given by 

= J ^F{x) e"^ (C3) 

The NEF distributions are the normalized versions of the 
distributions that define the cumulant-generating func- 
tion. In the original Stieltjes measure, the probabilities 
defined from these distributions are 

PiX e A) = / dF{x) e=^^-'^(^). (C4) 

J A 

With respect to the measure dF{x), we may obtain the 
solutions (jC2[) by extremizing the Lagrangian 



C=- dF{x) <j>{x) log (j){x) + 0[ dF{x) 0(a;) x- li] -k[ / dF{x) (j){x) - I 



(C5) 



12 



over its natural argument 4>{x) and the Lagrange multi- 
pliers and K. Here we have replaced the notation A from 
the text with Morris's 6 for ease of reference. From its 
role as a normalization constant, the multiplier k must 
evaluate to the cumulant-generating function ip{9) on so- 
lutions. 

Lagrangian problems of this form arise frequently in 
systems where a high-dimensional state space is pro- 
jected down onto a single coordinate x, which is the only 
observed property on which distributions depend. The 
Lagrangian (jCSP effectively treats (j>{x) as the ratio of a 
probability density to a uniform reference measure on the 
original high-dimensional space. The Stieltjes measure 
dF(x) is the marginal projection of the original measure 
onto the coordinate x, and the derivative dF/dx is known 
as the density of states. {dF{x) need not be smooth, 
and dF/dx may readily be a non-continuous distribu- 
tion, such as a sum of Dirac ^-functions, representing a 
discrete rather than continuous probability density). 

The entropy in this formulation appears as a standard 
Shannon entropy (equivalent to a relative entropy with a 
uniform reference measure) in the high-dimensional co- 
ordinates. It evaluates to the Legendre transform of the 
cumulant-generating function. 



s(p{e)) 



dF{x) p{x I 9) logp(x I 0) 



(C6) 



in which fi{0) is the mean value in the distribution 
p{x I 0). is the natural argument of ip, while fj, from 
the variational problem is the natural argument of S. 
Therefore it is usual to write this Legendre transform 
pair as 



dfi 



d9 



(C7) 



In the second line, 6'(/i) is the inverse function to /i(0). 
(In statistical mechanics, where —9 is the inverse temper- 
ature if X is the energy, tp arises as times the Helmholtz 
Free Energy.) 

We note several properties of these functions that will 
be useful in understanding Morris's NEF-QVF families. 
When — Q no correction to the normalization is needed 
in P{X € A), so we have immediately that "0(0) = 
as well. If we denote by /io = A'(O), then it follows that 
S{iiq) = also. The definition of the Legendre transform 
pair (jC7|) gives the important dual relations 



dV;(6>) 

d0 
dSV) 

dfj, 



= m 



(C8) 



It follows that dS/dfi\^^_^ — 0. With these two constants 
of integration, S{fi) will be completely specified by the 
form of its second derivative. 



b. Replacing the density of states witfi a reference measure 
in relative entropy 



For the univariate distributions, whether continuous 
or discrete, we may define a shorthand for Eq. ()C4p by 
identifying the probability density function on x as 



Px\0 



_ dF ^^g_^(g)^ 

dx 



(C9) 



The Lagrangian (jCSp becomes, under this change of vari- 
able. 



dxpj; log 



(dF/da 



Px 



dxpxX — /i 



dxp^ 



(CIO) 



The constraint terms are unchanged, but the entropy is 
now manifestly a relative entropy for the density px with 
reference measure dF/dx. 



c. Arriving at nonlinear expansion paths through mixed 
boundary conditions 

The Morris families, like the Pearson families and like 
our classes based on measurement, include distributions 
that are nonlinear in the values x taken by the natu- 
ral observation X. Both Morris's families and ours are 



based on affine transformation, so that their distribu- 
tions form what are known as expansion paths. (This 
term is used also in economics for constrained maximiza- 
tion problems, in which /j, generally describes a budget 
constraint. The original usage, in statistics, is mentioned 
in Ref. [iTj.l Whereas we achieve nonlinear dependence 
on X by considering the symmetries of measurement, 
the Morris system achieves nonlinearity through the use 
of mixed boundary conditions, when this system is de- 
scribed in terms of entropy maximization. By using two 
constraints — one to specify the family and the other to 
fix a point on the expansion path — Morris is able to apply 
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a fixed- gradient condition with respect to one constraint, 
and a fixed-value condition for the natural observation. 
Because we specify distributions from the affine trans- 
formation of a single observable, we must incorporate 
nonlinearities into the measurement function itself. 

Here, we derive the NEF criterion by converting the 
relative entropy to a form in which the reference mea- 
sure may be interpreted as a second observable. The 
ubiquitous use, in statistical physics and thermodynam- 
ics, of cumulant-generating functions and their Legendre 

I 



transforms under mixed boundary conditions, provides 
intuition from familiar systems for the meaning of the 
resulting expansion paths. In the next section we de- 
rive the way in which the QVF condition of Morris then 
places constraints on the reference measure, which plays 
the role of the secondary observation. 

The Lagrangian (jC10|) is an instance of a more general 
class of maximum entropy problems in which the rela- 
tive entropy has uniform measure (and therefore has the 
form of a Shannon entropy), and the reference measure 
appears as an additional constraint term. 



/ 



'dF 

A / dxpxhg ( — 



dxps; - 1 



(Cll) 



Here a variable A has been added as a parameter in the 
variational problem, parallel to the parameter /i in the 
constraint on J dxpxX. When A = 1, Eq. (jClip reduces 
to Eq. (jC10|) . and the choice of reference measure does 
not matter because it cancels in the two logarithms. For 
more general A, a uniform reference measure is explic- 
itly required to make the logarithms well-defined. The 
distribution solving Eq. (jCllI) is 



(C12) 



The Shannon entropy of Eq. (|Clip is maximized sub- 
ject to mixed constraints, which may be seen as follows. 
The entropy with two constraint terms is a function of 
two arguments S{fi,£_), where — (log (dF/da;)) at the 
given values of A and fi. Then A — —dS/d^, just as 
6 = —dS/dfi from Eq. (IC8[) . Because p, is an argument to 
the entropy, whereas A is a gradient, problems of this sort 
resemble solutions to differential equations under mixed 
Dirichlet and Neumann boundary conditions. 

The set of distributions (jC12[) . as A is held fixed and 
H is varied, make up the expansion path for the entropy 
with respect to constraint J dxpxX. The natural expo- 
nential families are the distributions on this expansion 
path, given a gradient constraint with respect to the ob- 
servable / dxpx log (dF/dx). 



4. The subset of natural exponential families with 
quadratic variation 

Any reference measure may in principle form the 
basis for an expansion path with mixed constraints. 
In contrast to Objective Bayesian methods, in which 
log (dF/dx) is constrained by symmetry, the Morris sys- 
tem constrains reference measures by restricting the form 
of the variance function — equivalent to restricting the 
form of the entropy — along the nonlinear expansion path. 



a. The Q VF family and Kullback-Letbler entropies 

The definition of the cumulant-generating function is 
that, not only does dip/d0 — /i, but d^ip/dO'^ is the vari- 
ance of the observation X. Morris defines its relation to 
the mean /i as a variance function V{fi). The quadratic 
variance relation is the dependence 



d^ 2 

— = Wo + Vip + V2fi ■ 

dO 



(C13) 



By definition of 9{fi) and fi{9) as inverse functions, 
it follows that the variance is also the (geometric and 
algebraic) inverse of the curvature of the relative entropy. 
We differentiate the second line in Eq. (|C7[) twice and 
substitute Eq. (|C13[) . to produce 



d^S 
d^ 



de 

dp 



-1 



Vq + Vl^l + V2li. 



(CM) 



Because we have first and second constants of integra- 
tion from the relations following Eq. (|C7[) . Eq. (IC14I) has 
an unambiguous integral. To assign meaning to this in- 
tegral, however, and in the process to expose a relation 
between the Morris and Pearson approaches to classifica- 
tion, we first factor the variance function into an overall 
normalization and the roots of the polynomial. Write 



Wo + VltJt. + V2tX^ = W2 (/i - Ml) (^i - Ai2) : 



with the solutions 



Ml, 2 




(C15) 



(C16) 



Then the integral of Eq. (jC14p becomes 

V2S = 



- /i 

Ai2 - Ml 



log 



- M 
M2 - Mo 



^^ - Ml 
M2 - Ml 



log 



M - Ml 
Mo - Ml , 
(C17) 
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If we denote by </3 = (/z — /zi) / {112 — Mi): the analytic 
continuation of a partition of the unit interval, we may 
write Eq. (|C17|) as 



V2S = (l-ip) log 
= ^(¥^11^0). 



iplog 



(C18) 



In the second line we use cp to stand for the "probabil- 
ity distribution" {ip, 1 — ip) on two atoms, and likewise 
for ipQ. D{Lp\\(po) is the Kullback-Leibler divergence of 
ip from the distribution ipo defined by the equilibrium 
mean fiQ and the variance function. The standard form 
for the curvature of a Kullback-Leibler divergence S may 
be written 



V2i^J■2 - Ml) 



2d^S 



1 



dM fi'^-f) 



(C19) 



A slight variation on the formula (|C17I) , making use of 
forms (jC16p for the roots, the Legendre transform rela- 
tions (|C7p , and the constants of integration, reads 



2w2V'(6') + vi^* = log 
= log 



ip2 - m) (m - Ml) 
(M2 - Mo) (mo - Ml) 

y (1 - y) 
po (1 - ipo) 



(C20) 



This integral relation between the cumulant-generating 
function and the variance function appears as Eq. 3.7 in 
Ref. [13. 



b. Two fundamental NEF-QVF families, and various limits 

Working in terms of the signs and magnitudes of the 
coefficients vq, vi, V2, Morris identifies exactly six in- 
equivalent natural exponential families with quadratic 
variance functions. Three are continuous (Gaussian, 
gamma, and hyperbolic-cosecant probability density 
functions), and three are discrete (binomial, negative- 
binomial, and Poisson probability mass functions), up 
to offset and scaling of the natural observation X. We 
will see here that, working in terms of the analytic struc- 
ture of the entropy (|C17|) . and a simple classification of 
the roots Hi^2, we may identify two main classes, cor- 
responding to the continuous and discrete distributions, 
and various limiting forms of these, which complete Mor- 
ris's families. 

The quantity that distinguishes the continuous from 
the discrete NEF-QVF families is the discriminant d = 
v\ — ^vqV2 = 4?;|(/i2 — Mi)^ (which is unchanged by off- 
set of X). In the case where d > 0, the variance func- 
tion (jC13[) has two real roots, while if d < 0, it has two 
complex-conjugate roots. By choice of offset and scale, 
we may obtain Morris's canonical families by making the 
complex-conjugate roots purely imaginary when d < 0, 
or by taking one of the two real roots to lie at the origin 
if d > 0. 



We begin with the imaginary roots, which select the 
continuous-valued NEF-QVF distributions. The canoni- 
cal form for these is obtained when vi = 0, and vq,V2 > 0. 
We may then define 



(C21) 



with A = \rUo/v2- 

The relative entropy, about a distribution p^^Q in the 
NEF-QVF family with mean /io, must have the form 



V2S = i log 



A2- 


hM'\ 




tan ^ 


fMo 


A2- 


^Mo/ 






I A 


A2- 


hM^A 




tan^^ 




A2- 


^Mo/ 







'-^ I - tan"^ I V 



— I - tan"^ I — 



M 
A 

A 

MO. 
(C22) 



The relation of 6* to /i and is 



V2e^ 



tan 



— tan 



(C23) 



If we choose a background in which ^0 = (by freedom 
to offset A"), it follows that we may write the cumulant- 
generating function as 



V2li^ = ^ log (1 



tan^ (V2A6')) 



(C24) 



The canonical normalization for this family of distri- 
butions is given hy V2 — 1- One may check directly that 
they are produced by the family of hyperbolic-cosecant 
density functions 



Px\0 



1 



1 



^ gira;/2A _j_ g-Tr2;/2A 

(The proof is by contour integral. Check that 

1 r°° du 



(C25) 



cos (A0) e^(^) = 



1 - 
Au {iuf 



({^uf + {-^uf) 



(C26) 



with integration variable u = e^^/"^^ and shifted param- 
eter 9 = 2A0/7r. The contour that avoids branch cuts, in 
the log-transform to variables u, closes in the negative- 
imaginary half-plane, encircling the pole u = —i.) The 
distributions at A = 1 are the canonical densities given 
in Ref. Eq. 4.2 

It is straightforward to check that, as A 00, the 
relative entropy ()C22|) reduces to the form 



(m - Mo) 



for a Gaussian distribution 
1 



2vo 



-(x-hq)'' /2vo 



(C27) 



(C28) 



15 



with arbitrary mean. We have used V2^^ = wo as V2 — > 0. 

In the other limit, as A — 0, it is convenient to take 
V2 = 1/q = l//^o J in which case we recover the relative 
entropy 



S' -> Mo - A* + A^o log ( — 



appropriate to the standard gamma distribution 

1 



Px\0 



r(g) 



(C29) 



(C30) 



Two of the three continuous-valued NEF-QVF fami- 
lies, therefore, are degenerate limits of the hyperbolic- 
cosecant distribution, which represents the generic case. 

The discrete- valued families, following when the vari- 
ance function has real roots, may be handled in similar 
fashion. We choose canonical forms by offsetting x to set 
/ii = 0, and attain this in the variance function by taking 
vq ^ Q. The canonical scale for x is then given by taking 
VI = I. 

For the discrete distributions, there are two "interior" 
families of solutions (the binomial and negative bino- 
mial), and one limiting family (the Poisson) that may 
be reached from either of them. The root /i2 = —vxjv^ 
in all cases. To obtain the binomial distribution on N 
samples with mean /io — pN , 



Px\0 = ( ^ I P^'i^-P)^" 



(C31) 



we take /i2 — N, corresponding to V2 = —1/N. For this 
distribution only, the range is finite, < a; < A''. The 
relative entropy takes the standard form of a Kullback- 
Leibler divergence without extending the definition of ip 
by analytic continuation. 



N 



1-p 



= -N 



NJ fio/N 
ND{n/N\\p). 









V P , 




(-) 




Vmo/ 



(C32) 



The negative binomial distribution is immediately ob- 
tained by taking N —N in the second line of Eq. (jC32[) 
while holding /io fixed. The corresponding distribution 
is 



Px\t) 



N -I 

X 



P(l-P) , 



(C33) 



with p — fio/ {N + /io). This is the other "interior" solu- 
tion, with IjL2 — ~N and therefore V2 = 1/N. 

The Poisson distribution is the limit of either of the 
previous two forms as — > 0, so /i2 — >■ ±00, at p = /^o 
fixed. The distribution is 



Px\0 



,M5 



(C34) 



and the entropy becomes 



5" ^ /i - /io - /i log ( — ) , 
VMo/ 



(C35) 



For either of the negative binomial or the Poisson, the 
range of x is unbounded, a; > 0. 

The relative entropy expressions (|C29IC35P for the 
gamma and the Poisson distributions are the same func- 
tional form, under exchange of the reference mean /io 
with the distribution mean fi. Their respective distribu- 
tions are likewise interchanged under exchange of x with 
/io, except that in the gamma case (jC30p . a further shift 
/^o — ^ Mo — 1 rnust be performed as well. We will return 
to integer shifts of this form in the next section. 

(We note that the association of imaginary roots 
with continuous-valued distributions, and of real roots 
with discrete-valued distributions, is a defining struc- 
tural feature of quantum-mechanical distributions for 
particles with finite temperature but continuous time- 
dependence [m . This is one of many interesting connec- 
tions to the NEF-QVF families that it will not be possible 
to explore in this publication.) 



5. Relations to the Pearson system through 
large-deviations formulae 

It is instructive to compare the forms for the entropies 
of the distributions in the NEF-QVF families to the log- 
arithms of the probability densities or mass functions 
themselves. By virtue of the entropy as a large-deviations 
measure 



2J|, it and the log-probability will coincide to 
leading exponential order for sufficiently sharply peaked 
distributions. 

The entropy is defined in the Morris system as a second 
integral of a rational function with two poles. The logPa,|o 
is defined in the Pearson system similarly, except that it 
is a first-integral of a rational function with two poles [llj . 
The difference between these two degrees of integration 
leads to non-coincidence of the two families, though in 
many parameter limits they overlap. 

We begin by comparing the continuous distributions. 
For the Gaussian, the two functions are identical up to a 
constant 



logP; 



'x\0 



S = - 



jx - /ip)- 
2i;o 

2wo 



1 



log (27ruo) 



(C36) 



For the standard gamma with mean /io 



logPxio 



5: 



q — X 



x + {q 

- q log 



l)log 



9-1 



(C37) 



in which the ~ in the second line keeps the first two 
terms in Stirling's formula for logr(q). The functions 
are identical in form but differ by an offset g — > g — 1. 
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The hyperbolic cosecant density shows the least simi- 
larity in its domain of small argument. However, at small 
A, where it is sharply peaked, and at fixed x or /x, the 
following expansion becomes informative. 



logP:E|0 



-^-logA-log fl + e-l-IM 
2A V 



S = -^tan-i Q - log A + i log {^? + A^) . 

(C38) 

For /x/A ^ 1, tan~^ (/i/A) sgn(^)7r/2, giving the 
same two leading terms for x and for ^. 

The discrete distributions behave similarly. For the 
binomial, 



logP:r|0 



s 



-ND 



(1 






n) 








n) 



(C39) 



and for the Poisson 



logP2;|o « a; - ^0 - a: log — 



5 = ^ - ^0 - M log — 
.Mo 



(C40) 



where again « stands for the first two terms in Stirling's 
formula for factorials. Within these approximations, the 
two functions are identical. The negative binomial dif- 
fers by terms at 0{x/N), but within a similar Stirling 
approximation, it may be written 



logP:E|o ^ {N + x) log 



N + x 



+ (A^ + a;)log (1 



xhsl — 



1 



N + x 



Mo. 
iVlog ( 1 



1 

N 



- log [1 + 
{N + x) log 
-O 



N- 1 

N + x 



N- 1 



S={N + fi) log 



X log 



Mlog 



Mo 



(C41) 



The leading terms, corresponding to the analytic contin- 
uation of the Kullback-Leibler form, again coincide. The 
only differences arise from shifts — >■ — 1 in a subset 
of terms, similar to the shift g — > q — 1 in Eq. (jC37[) . 



The equivalence of logpa;|o a-nd S to leading expo- 
nential order is a consequence of the large- deviations 
property [13] for these distributions. The cumulant- 
generating function is the integral of the shifted density, 



e'^W = J dxp.ioe^^ (C42) 

The exponential of the entropy cancels the absolute mag- 
nitude of the inserted weight factor e^^ near the max- 
imum of the shifted distribution, because for sharply 
peaked distributions the maximum is near a; « /z. 



(C43) 



(This property of the entropy is equivalent to that of 
functions known as effective actions, as developed in 
Ref. d^.) S {n) is therefore approximately equal to 
Px\Oi evaluated at x ~ fi. Thus, the Morris restriction 
to quadratic variance functions implies that logp^iiOi 
leading order, will equal the analytic continuation of a 
function of Kullback-Leibler form. 



[1] Cover, T. M. and Thomas, J. A. (2006). Elements of 

Information Theory. Wiley, Hoboken, NJ, 2nd edition. 
[2] Embrechts, P., Kluppelberg, C, and Mikosch, T. (1997). 

Modeling Extremal Events: For Insurance and Finance. 

Springer Verlag, Heidelberg. 
[3] Frank, S. A. (2009). The common patterns of nature. J. 

Evol. Biol, 22:1563-1585. 
[4] Frank, S. A. and Smith, D. E. (2010). Measurement 



invariance, entropy, and probability. Entropy, 12:289- 
303. 

[5] Hand, D. (2004). Measurement Theory and Practice. 

Arnold, London. 
[6] Jaynes, E. (1968). Prior probabilities. IEEE Transactions 

on Systems Science and Cybernetics, 4(3):227-241. 
[7] Jaynes, E. T. (1957a). Information theory and statistical 

mechanics. Phys. Rev., 106(4) :620-630. 



17 



Jayncs, E. T. (1957b). Information theory and statistical 
mechanics. II. Phys. Rev., 108(2):171-190. 
Jayncs, E. T. (2003). ProbalnUty Theory: The Logic of 
Science. Cambridge University Press, New York. 
Jeffries, H. (1957). Scientific Inference. Cambridge Univ. 
Press, London, 2nd edition. 

Johnson, N. L., Kotz, S., and Balakrishnan, N. (1994). 
Continuous Univariate Distributions, volume 1. Wiley, 
New York, 2nd edition. 

Johnson, N. L., Kotz, S., and Balakrishnan, N. (1995). 
Continuous Univariate Distributions, volume 2. Wiley, 
New York, 2nd edition. 

Kleiber, C. and Kotz, S. (2003). Statistical Size Distribu- 
tions in Economics and Actuarial Sciences. Wiley, New 
York. 

Kotz, S. and Nadarajah, S. (2000). Extreme Value Dis- 
tributions: Theory and Applications. World Scientific, 
Singapore. 

Luce, R. D. and Narens, L. (2008). Measurement, theory 

of. In Durlauf, S. N. and Blurnc, L. E., editors. The New 
Palgrave Dictionary of Economics. Palgravo Macmillan, 
Basingstoke. 

Mahan, G. D. (2000). Many Particle Physics. Springer, 
New York, 3rd edition. 

Morris, C. N. (1982). Natural exponential families 
with quadratic variance functions. Annals of Statistics, 



10(1):65 80. 

[18] Morris, C. N. (1983). Natural exponential families with 
quadratic variance functions: statistical theory. Annals 
of Statistics, 11:515-529. 

[19] Morris, C. N. and Lock, K. F. (2009). Unifying the named 
natural exponential families and their relatives. Ameri- 
can Statistician, 63(3):247-253. 

[20] Narens, L. and Luce, R. D. (2008). Meaningfulness and 
invarianee. In Durlauf, S. N. and Blume, L. E., edi- 
tors. The New Palgrave Dictionary of Economics. Pal- 
grave Macmillan, Basingstoke. 

[21] Sato, K. (2001). Basic results on levy processes. In 
Barndorff-Nielsen, O. E., Mikosch, T., and Resnick, S. I., 
editors. Levy Processes: Theory and Applications, pages 
3-37, Boston. Birkauser. 

[22] Seidenfeld, T. (1979). Why I am not an objective 
Bayesian: some reflections prompted by Rosenkrantz. 
Theory and Decision, 11:413-440. 

[23] Smith, E. (2010). Large-deviation principles, stochas- 
tic effective actions, path entropies, and the structure 
and meaning of thermodynamic descriptions. Rev. Mod. 
Phys., (Submitted). 

[24] Touchette, H. (2009). The large deviation approach to 
statistical mechanics. Physics Reports, 478:1-69. 

[25] Weyl, H. (1952). Symmetry. Princeton University Press, 
Princeton. 



